Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Mon Dec 11 16:31:06 2023"
The course started quite well and smoothly. I like the
introduction by professor Kimmo and generally everything looks quite
scheduled and organised. From the course I expect to improve my
bioinformatic and data science related knowledge. I already have quite a
bit of \(R\) skills but I’m sure I will
improve those as well with this course: I really like programming and
\(R\) is a quite cool language. For
sure, it shines when used with Rstudio. I actually didn’t know about the
Git communication!
I heard about this course from a colleague of mine.
Here’s the link to my GitHub
repository.
Describe the work you have done this week and summarize your learning.
date()
## [1] "Mon Dec 11 16:31:06 2023"
# load the GGally and ggplot2 libraries
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
# read students2014 dataset after manipulation (from data wrangling)
data <- read.csv("analysis.csv", header = T, row.names = 1)
# explore the dataset
str(data)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ Attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
dim(data)
## [1] 166 7
The dataset is composed of 166 lines and 7 columns.
This dataset is a subset of a bigger one reporting the result of student
survey. Each observation (i.e. row) is a student and the variables (i.e.
columns) represent statistics of the students. Variables are:
# visualize all of the variables and relationships in a single plot
ggpairs(data, mapping = aes(col = gender, alpha = 0.3),
lower = list(combo = wrap("facethist", bins = 20)))
From the visualization we can appreciate that the distributions of
the variables does not differ much when the gender difference is taken
into account. Only age variable has more outliers for females
than males.
Two variable pairs show significant correlation: positive for
attitude and points, and negative for surf
and deep. Interestingly for the latter case (surf vs
deep), the negative correlation is significant only for
males.
# fit a model where points is the outcome variable and attitude, stra and surf are
# the explanatory variables
fit <- lm(Points ~ Attitude + stra + surf, data = data)
# summary of fitted model
summary(fit)
##
## Call:
## lm(formula = Points ~ Attitude + stra + surf, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## Attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
As the three explanatory variables I selected attitude,
stra and surf since they are that shows the highest
absolute correlation with the target variable.
From the summary we can get some information about the model:
Looking at the individual explanatory variables, only attitude, with a pvalue of ~1.9e-8, seems to give an important contribution to the model. The other variables don’t show a significant pvalue. So, I’m removing them from the model.
# fit a model where points is the outcome variable and attitude and stra are the
# explanatory variables
fit_2 <- lm(Points ~ Attitude + stra, data = data)
# summary of fitted model
summary(fit_2)
##
## Call:
## lm(formula = Points ~ Attitude + stra, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9729 2.3959 3.745 0.00025 ***
## Attitude 3.4658 0.5652 6.132 6.31e-09 ***
## stra 0.9137 0.5345 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
# fit a model where points is the outcome variable and attitude is the explanatory
# variable.
fit_1 <- lm(Points ~ Attitude, data = data)
# summary of fitted model
summary(fit_1)
##
## Call:
## lm(formula = Points ~ Attitude, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## Attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
# summary of fitted model with only attitude as explanatory variable
summary(fit_1)
##
## Call:
## lm(formula = Points ~ Attitude, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## Attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
# plot the two variables
ggplot(data = data, aes(x = Attitude, y = Points)) +
geom_point() +
geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
The explanatory variable attitude is significant associated with the target variable points (pvalue is significant, 4.12e-09). Considering the formula \(yi = β0 + β1xi\), from the summary we can appreciate that our model shows \(β0\) = 11.6372 and \(β1\) = 3.5255. This means that for each +1 of the explanatory variable, x, the target variable, y, increases of +3.5255.
The fitted model has a Multiple R-squared of 0.1906, meaning that the ~19% of the variable of the target variable is explained by the explanatory variable.
# produce Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage
# diagnostic plots.
par(mfrow = c(2,2))
plot(fit_1, which = c(1,2,5))
Residual vs Fitted plot displays the residuals vs the fitted values showing if the residuals have a non-linear pattern. If the values equally spread around the horizontal line without showing any pattern it means that non-linear relationships are not present into the data.
Normal QQ-plot shows if the residuals are normally distributed: if it’s so, they are distributed on the dashed line.
Residuals vs Laverage plot displays Standardized residuals vs leverage. In this plot values should not outside the Cook’s distance 1.
Describe the work you have done this week and summarize your learning.
date()
## [1] "Mon Dec 11 16:31:15 2023"
# load libraries
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(boot)
library(tidyr)
# load the alc dataset
alc <- read.table("data/alc.csv", sep = '\t', header = T)
# print out the names of the variables and describe the data set briefly
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
The dataset is the result of the merging of two identical
questionnaires including student grades, demographic, social and school
related features related to secondary school student alcohol consumption
in Portugal. The two original questionnaires report the performance in
two distinct subjects: mathematics and Portuguese language.
The two questionnaires were merged using the common variables (columns);
the questionnaire-specific columns were merged by calculating the mean
of the ones with numeric data and by reporting the results of the
mathematics questionnaire of the ones with non numeric data.
# draw a bar plot of each variable
gather(alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
Looking at the variables, here are the four hypotheses for possible alcohol consupton relationships:
# sex
data <- data.frame(sex = c("M", "F"),
high_alc = c(nrow(alc[alc$sex == "M" & alc$high_use == TRUE,]),
nrow(alc[alc$sex == "F" & alc$high_use == TRUE,])),
low_alc = c(nrow(alc[alc$sex == "M" & alc$high_use == FALSE,]),
nrow(alc[alc$sex == "F" & alc$high_use == FALSE,])))
rownames(data) <- data[,1]
data <- data[,-1]
barplot(as.matrix(data) , beside=T,col=c("blue" , "red") , ylab="")
# age
ggplot(alc, aes(x = high_use, y = age)) +
geom_boxplot() + xlab("high alcohol consumption") + ylab("age") +
theme_classic()
# studytime
ggplot(alc, aes(x = high_use, y = studytime)) +
geom_boxplot() + xlab("high alcohol consumption") + ylab("studytime") +
theme_classic()
# famsup
data <- data.frame(famsup = c("yes", "no"),
high_alc = c(nrow(alc[alc$famsup == "yes" & alc$high_use == TRUE,]),
nrow(alc[alc$famsup == "no" & alc$high_use == TRUE,])),
low_alc = c(nrow(alc[alc$famsup == "yes" & alc$high_use == FALSE,]),
nrow(alc[alc$famsup == "no" & alc$high_use == FALSE,])))
rownames(data) <- data[,1]
data <- data[,-1]
barplot(as.matrix(data) , legend.text = T, beside=T,col=c("darkgreen" , "purple"),
ylab="", main= "famsup")
According to my hypothesis, male students seem to have a higher
alcohol consumption compared to female peers, resulting in sex
having an important role in alcohol abuse prediction. The same goes for
age: older students seem to have a higher alcohol
consumption.
Interestingly, studytime looks a very important predictor of
alcohol abuse, even more than sex.
The effect of famsup on the consumption is hard to establish
from the plot: it seems that an effect is actually present but it is not
as significant as for the other predictors.
fit <- glm(high_use ~ sex + age + studytime + famsup , data = alc, family = "binomial")
summary(fit)
##
## Call:
## glm(formula = high_use ~ sex + age + studytime + famsup, family = "binomial",
## data = alc)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.8395 1.7471 -2.198 0.02797 *
## sexM 0.7268 0.2502 2.905 0.00367 **
## age 0.2120 0.1015 2.088 0.03681 *
## studytime -0.5076 0.1614 -3.144 0.00167 **
## famsupyes 0.1393 0.2501 0.557 0.57747
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 420.92 on 365 degrees of freedom
## AIC: 430.92
##
## Number of Fisher Scoring iterations: 4
OR <- coef(fit) %>% exp
# compute confidence intervals (CI)
CI <- confint(fit) %>% exp
## Waiting for profiling to be done...
CI
## 2.5 % 97.5 %
## (Intercept) 0.0006667136 0.6396151
## sexM 1.2698828036 3.3925294
## age 1.0150125081 1.5128364
## studytime 0.4341702473 0.8188937
## famsupyes 0.7072097008 1.8883245
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.02150485 0.0006667136 0.6396151
## sexM 2.06838548 1.2698828036 3.3925294
## age 1.23609527 1.0150125081 1.5128364
## studytime 0.60193318 0.4341702473 0.8188937
## famsupyes 1.14948947 0.7072097008 1.8883245
As observed from the data visualization, sex, age and studytime are important predictor of alcohol consumption. Male students are indeed significantly more prone to abuse alcohol compared to females. Age correlates too: the older the students are the higher the alcohol consumption gets. Furethermore, as predicted and observed from the plots, studytime inversely correlates with with alcohol abuse. Famsup, on the other hand, is not correlated with the consumption: while minor effect is observed but that is not statistically significant (pvalue = 0.58).
fit1 <- glm(high_use ~ sex + age + studytime, data = alc, family = "binomial")
summary(fit1)
##
## Call:
## glm(formula = high_use ~ sex + age + studytime, family = "binomial",
## data = alc)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.6554 1.7142 -2.132 0.03297 *
## sexM 0.7072 0.2474 2.858 0.00426 **
## age 0.2054 0.1008 2.037 0.04161 *
## studytime -0.4972 0.1601 -3.105 0.00190 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 421.24 on 366 degrees of freedom
## AIC: 429.24
##
## Number of Fisher Scoring iterations: 4
# predict() the probability of high_use
probabilities <- predict(fit1, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 244 15
## TRUE 92 19
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2891892
# K-fold cross-validation
cv <- cv.glm(data = alc, cost = loss_func, glmfit = fit1, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2945946
The prediction error of my model using 10-fold cross-validation is
~0.29, meaning that my model is worse than the one introduced in the
Exercise Set.
A better prediction variables selection is needed to improve my
model.
Describe the work you have done this week and summarize your learning.
date()
## [1] "Mon Dec 11 16:31:20 2023"
# load libraries
library(tibble)
library(readr)
library(GGally)
library(FactoMineR)
library(tidyr)
library(dplyr)
# load the Human data
human <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/human2.csv")
## Rows: 155 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (8): Edu2.FM, Labo.FM, Life.Exp, Edu.Exp, GNI, Mat.Mor, Ado.Birth, Parli.F
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# move the country names to rownames
human <- column_to_rownames(human, "Country")
# visualize the 'human' variables
ggpairs(human, progress = FALSE)
The dataset contains information about 8 qualaity-of-life variables
for the different countries (155).
Looking at the pairs plot, we can appreciate that many variables have a
correlation with each other. 7 comparisons underline a strong
significant positive correlation between variables, namely
Life.Ep vs Edu2.FM,
Edu.Exp vs Edu2.FM,
GNI vs Edu2.FM,
Edu.Exp vs Life.Exp,
GNI vs Life.Exp, GNI
vs Edu.Exp and Ado.Birth vs
Mat.Mort. So, out of these 7 positive correlations, 3
of them include correlation with Edu2.FM: that suggests
how countries with a high proportion of males with at least secondary
education have a higher Life expectancy at birth, Expected years of
schooling and Gross National Income per capita.
The other way around, 8 comparisons show a strong significant negative
correlation between variables, namely Mat.Mort vs
Edu2.FM, Ado.Birth vs
Edu2.FM, Mat.Mort vs
Life.Exp, Ado.Birth vs
Life.Exp, Mat.Mort vs
Edu.Exp, Ado.Birth vs
Edu.Exp, Mat.Mort vs
GNI and Ado.Birth vs
GNI.
# perform PCA
pca_res <- prcomp(human)
# variability captured by the PCs
s <- summary(pca_res)
s
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000 1.0000
# biplot
biplot(pca_res, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
# scale human data
human_scaled <- scale(human)
# perform PCA
pca_res_scaled <- prcomp(human_scaled)
# variability captured by the PCs
s_scaled <- summary(pca_res_scaled)
# biplot
biplot(pca_res_scaled, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
This second PCA looks much better. In fact, the first PCA, resulting
from a non-scaled data-frame, is not very informative since the variable
GNI, which has values much higher compared to the other
ones, is the only visible, making the other ones (and the Country names)
barely recognizable.
After scaling (second PCA), all the variables are in the same range,
making any interpretation of the result possible.
From the scaled PCA plot, I can appreciate that the PC1 (x acis),
accounts for the variability of most of the variables:
Mat.Mort, Abo.Birth on the positive
sign, and Edu.FM, Edu.Exp,
GNI and Life.Exp on the negative sign.
The PC2 (y axis), on the other hand accounts for the variability of
Parli.F and Labo.FM, both on the
positive sign.
Focusing on the PC1, it’s interesting to appreciate that the variables
describing development in different aspects of society, like life
expentancy, GNI or educational expectancy points towards the negative
side, meaning that more developed countries are placed on the left of
the plot (of course, positive and negative does not mean anything here,
I’m just stating that “good” variables goes to the left). Taken that
into account, we can see that both maternal mortality ratio and
Adolescent birth rate point towards right, so the opposite compared to
the “good” measurements I’ve mentioned before. If that was is not
surprising for maternal mortality ratio, the presence of Adolescent
birth rate in the opposite direction compared to the “good” measures
underlines that people in developed countries tend to have children
later than in less developed ones.
# laod tea dataset
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv",
stringsAsFactors = TRUE)
dim(tea)
## [1] 300 36
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
## $ frequency : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
# select the 'keep_columns' to create a new dataset
tea <- select(tea, keep_columns)
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(keep_columns)
##
## # Now:
## data %>% select(all_of(keep_columns))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# perform MCA
mca <- MCA(tea, graph = F)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea, graph = F)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.279 0.261 0.219 0.189 0.177 0.156 0.144
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519 7.841
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953 77.794
## Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.141 0.117 0.087 0.062
## % of var. 7.705 6.392 4.724 3.385
## Cumulative % of var. 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139 0.003
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626 0.027
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111 0.107
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841 0.127
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979 0.035
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990 0.020
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347 0.102
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459 0.161
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968 0.478
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898 0.141
## v.test Dim.3 ctr cos2 v.test
## black 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 2.867 | 0.433 9.160 0.338 10.053 |
## green -5.669 | -0.108 0.098 0.001 -0.659 |
## alone -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 3.226 | 1.329 14.771 0.218 8.081 |
## milk 2.422 | 0.013 0.003 0.000 0.116 |
## other 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
# biplot
plot(mca, invisible=c("ind"), graph.type = "classic", habillage = "quali" )
From the plot, we can see that the majority of the variables are spread
along the PC2, implying that PC2 is able to “stratify” most of the
questions in the tea dataset.
On the other hand, the categories unpackaged and
tea shop are clearly spread on the right (positive)
side of PC1.
Drawing conclusions from this plot is not easy: what we can speculate is
that PC1 is able to stratify those people that we can identify as “tea
connoisseurs”: those who get tea at the tea shop usually know what they
want and buy unpackaged; the other way around, people who are “casual”
tea drinkers, and get tea from chain store they buy tea in a tea bag
(both these varaibles are located on the negative side of the PC1).
***
Describe the work you have done this week and summarize your learning.
date()
## [1] "Mon Dec 11 16:31:27 2023"
# load the GGally and ggplot2 libraries
library(dplyr)
library(tidyr)
library(ggplot2)
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
First of all, let’s have a look at the data and give a brief description of the dataset.
# Read the RATS data
RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt",
sep = "\t", header = T)
# Look at the (column) names of RATS
names(RATS)
## [1] "ID" "Group" "WD1" "WD8" "WD15" "WD22" "WD29" "WD36" "WD43"
## [10] "WD44" "WD50" "WD57" "WD64"
# Look at the structure of RATS
str(RATS)
## 'data.frame': 16 obs. of 13 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Group: int 1 1 1 1 1 1 1 1 2 2 ...
## $ WD1 : int 240 225 245 260 255 260 275 245 410 405 ...
## $ WD8 : int 250 230 250 255 260 265 275 255 415 420 ...
## $ WD15 : int 255 230 250 255 255 270 260 260 425 430 ...
## $ WD22 : int 260 232 255 265 270 275 270 268 428 440 ...
## $ WD29 : int 262 240 262 265 270 275 273 270 438 448 ...
## $ WD36 : int 258 240 265 268 273 277 274 265 443 460 ...
## $ WD43 : int 266 243 267 270 274 278 276 265 442 458 ...
## $ WD44 : int 266 244 267 272 273 278 271 267 446 464 ...
## $ WD50 : int 265 238 264 274 276 284 282 273 456 475 ...
## $ WD57 : int 272 247 268 273 278 279 281 274 468 484 ...
## $ WD64 : int 278 245 269 275 280 281 284 278 478 496 ...
# Print out summaries of the variables
summary(RATS)
## ID Group WD1 WD8 WD15
## Min. : 1.00 Min. :1.00 Min. :225.0 Min. :230.0 Min. :230.0
## 1st Qu.: 4.75 1st Qu.:1.00 1st Qu.:252.5 1st Qu.:255.0 1st Qu.:255.0
## Median : 8.50 Median :1.50 Median :340.0 Median :345.0 Median :347.5
## Mean : 8.50 Mean :1.75 Mean :365.9 Mean :369.1 Mean :372.5
## 3rd Qu.:12.25 3rd Qu.:2.25 3rd Qu.:480.0 3rd Qu.:476.2 3rd Qu.:486.2
## Max. :16.00 Max. :3.00 Max. :555.0 Max. :560.0 Max. :565.0
## WD22 WD29 WD36 WD43
## Min. :232.0 Min. :240.0 Min. :240.0 Min. :243.0
## 1st Qu.:267.2 1st Qu.:268.8 1st Qu.:267.2 1st Qu.:269.2
## Median :351.5 Median :356.5 Median :360.0 Median :360.0
## Mean :379.2 Mean :383.9 Mean :387.0 Mean :386.0
## 3rd Qu.:492.5 3rd Qu.:497.8 3rd Qu.:504.2 3rd Qu.:501.0
## Max. :580.0 Max. :590.0 Max. :597.0 Max. :595.0
## WD44 WD50 WD57 WD64
## Min. :244.0 Min. :238.0 Min. :247.0 Min. :245.0
## 1st Qu.:270.0 1st Qu.:273.8 1st Qu.:273.8 1st Qu.:278.0
## Median :362.0 Median :370.0 Median :373.5 Median :378.0
## Mean :388.3 Mean :394.6 Mean :398.6 Mean :404.1
## 3rd Qu.:510.5 3rd Qu.:516.0 3rd Qu.:524.5 3rd Qu.:530.8
## Max. :595.0 Max. :612.0 Max. :618.0 Max. :628.0
The RATS dataset in the wide data form: we have 16 subjects, rats, that are split into three groups and biological measurements are taken for each rat every week. Time points are WD1, WD8, WD15, WD22, WD29, WD36, WD43, WD44, WD50, WD57 and WD64.
# convert the ID and Group columns to factors
RATS$ID <- factor(RATS$ID)
RATS$Group <- factor(RATS$Group)
# convert to longitudinal form
RATS_L <- pivot_longer(RATS, cols = -c(ID, Group), names_to = "WD",
values_to = "Weight") %>%
mutate(Time = as.integer(substr(WD, 3,4))) %>%
arrange(Time)
dim(RATS_L)
## [1] 176 5
glimpse(RATS_L)
## Rows: 176
## Columns: 5
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3,…
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1, …
## $ WD <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", …
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 470…
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, …
Now the dataset is in the longitudinal form. In the wide form, we had one line per each rat where variables were the groups and the time-points when the biological measurements were taken. However, in the longitudinal form, we have as many lines as many rats * time points (and 16 * 11): each line now is rat #1 - time point #1 - biological measurement for time point #1.
# draw the plot
ggplot(RATS_L, aes(x = Time, y = Weight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(RATS_L$Weight), max(RATS_L$Weight)))
From this first plot, we can already appreciate that the weight of
almost all the rats tend to increase during the treatment. Also, we can
observe the tracking phenomenon (rats whose weight was high at the
beginning weight more throughout the study).
However, the non-standardized nature of the data makes drawing
conclusions hard.
# standardise the variable Weight
RATS_L_st <- RATS_L %>%
group_by(Time) %>%
mutate(Weight_st = (Weight - mean(Weight))/sd(Weight)) %>%
ungroup()
# look at the data
glimpse(RATS_L_st)
## Rows: 176
## Columns: 6
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2,…
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, …
## $ WD <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1…
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, …
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, …
## $ Weight_st <dbl> -1.0011429, -1.1203857, -0.9613953, -0.8421525, -0.8819001, …
# plot again with the standardised Weight
ggplot(RATS_L_st, aes(x = Time, y = Weight_st, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
scale_y_continuous(name = "standardized weight")
# summary data with mean and standard error of weight by treatment and week
RATS_L_st_se <- RATS_L_st %>%
group_by(Group, Time) %>%
summarise( mean = mean(Weight), se = (sd(Weight)/sqrt(Weight)) ) %>%
ungroup()
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'Group', 'Time'. You can override using the
## `.groups` argument.
glimpse(RATS_L_st_se)
## Rows: 176
## Columns: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, 8, 8, 8, 15, 15, 15, 15, …
## $ mean <dbl> 250.625, 250.625, 250.625, 250.625, 250.625, 250.625, 250.625, 2…
## $ se <dbl> 0.9825486, 1.0147718, 0.9724709, 0.9440022, 0.9532122, 0.9440022…
# plot the mean profiles
ggplot(RATS_L_st_se, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
geom_line() +
scale_linetype_manual(values = c(1,2,3)) +
geom_point(size=3) +
scale_shape_manual(values = c(1,2,3)) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
theme(legend.position = c(0.9,0.5)) +
scale_y_continuous(name = "mean(Weight) +/- se(Weight)")
From this last plot, showing average profiles for each rats group with mean and standard error at each time point, we can see that there is no overlap in the means of the rat groups.
# create a summary data by Group and ID with mean as the summary variable
# (ignoring baseline Time 1)
RATS_S <- RATS_L %>%
filter(Time > 1) %>%
group_by(Group, ID) %>%
summarise( mean=mean(Weight) ) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
glimpse(RATS_S)
## Rows: 16
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean <dbl> 263.2, 238.9, 261.7, 267.2, 270.9, 276.2, 274.6, 267.5, 443.9, 4…
# draw a boxplot of the mean versus Group
ggplot(RATS_S, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(Weight), Days 1-64")
The created boxplots reflect quite well what we observed from the previous plots. This plot also shows that the variability of Group 3 is the smallest. All the groups are characterized by the presence of an outlier each: we can try to remove them.
# Create a new data by filtering the outlier and adjust the ggplot code the draw the plot again with the new data
RATS_S_clean <- RATS_S[!(RATS_S$Group == 3 & RATS_S$mean < 500),] %>%
subset(mean < 590) %>% subset(mean > 240)
# plot again
ggplot(RATS_S_clean, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(Weight), Days 1-64")
Since t-test require 2 levels (Groups) and we have 3 of them, such test can’t be applied. Let’s move to ANOVA instead
# Add the RATS_S from the original data as a new variable to the summary data
RATS_S <- RATS_S %>%
mutate(baseline = RATS$WD1)
# Fit the linear model with the mean as the response
fit1 <- lm(mean ~ baseline + Group, data = RATS_S)
summary(fit1)
##
## Call:
## lm(formula = mean ~ baseline + Group, data = RATS_S)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.905 -4.194 2.190 7.577 14.800
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.16375 21.87657 1.516 0.1554
## baseline 0.92513 0.08572 10.793 1.56e-07 ***
## Group2 34.85753 18.82308 1.852 0.0888 .
## Group3 23.67526 23.25324 1.018 0.3287
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.68 on 12 degrees of freedom
## Multiple R-squared: 0.9936, Adjusted R-squared: 0.992
## F-statistic: 622.1 on 3 and 12 DF, p-value: 1.989e-13
# Compute the analysis of variance table for the fitted model with anova()
anova(fit1)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## baseline 1 253625 253625 1859.8201 1.57e-14 ***
## Group 2 879 439 3.2219 0.07586 .
## Residuals 12 1636 136
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Looking at the model and at the ANOVA testing, when the baseline is taken into account, the groups don’t seem to be significantly different anymore. In other words, the baseline is explaining most of variability between the two groups.
First of all, let’s have a look at the data and give a brief description of the dataset.
# Read the BPRS data
BPRS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt",
sep = " ", header = T)
# Look at the (column) names of RATS
names(BPRS)
## [1] "treatment" "subject" "week0" "week1" "week2" "week3"
## [7] "week4" "week5" "week6" "week7" "week8"
# Look at the structure of RATS
str(BPRS)
## 'data.frame': 40 obs. of 11 variables:
## $ treatment: int 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : int 1 2 3 4 5 6 7 8 9 10 ...
## $ week0 : int 42 58 54 55 72 48 71 30 41 57 ...
## $ week1 : int 36 68 55 77 75 43 61 36 43 51 ...
## $ week2 : int 36 61 41 49 72 41 47 38 39 51 ...
## $ week3 : int 43 55 38 54 65 38 30 38 35 55 ...
## $ week4 : int 41 43 43 56 50 36 27 31 28 53 ...
## $ week5 : int 40 34 28 50 39 29 40 26 22 43 ...
## $ week6 : int 38 28 29 47 32 33 30 26 20 43 ...
## $ week7 : int 47 28 25 42 38 27 31 25 23 39 ...
## $ week8 : int 51 28 24 46 32 25 31 24 21 32 ...
# Print out summaries of the variables
summary(BPRS)
## treatment subject week0 week1 week2
## Min. :1.0 Min. : 1.00 Min. :24.00 Min. :23.00 Min. :26.0
## 1st Qu.:1.0 1st Qu.: 5.75 1st Qu.:38.00 1st Qu.:35.00 1st Qu.:32.0
## Median :1.5 Median :10.50 Median :46.00 Median :41.00 Median :38.0
## Mean :1.5 Mean :10.50 Mean :48.00 Mean :46.33 Mean :41.7
## 3rd Qu.:2.0 3rd Qu.:15.25 3rd Qu.:58.25 3rd Qu.:54.25 3rd Qu.:49.0
## Max. :2.0 Max. :20.00 Max. :78.00 Max. :95.00 Max. :75.0
## week3 week4 week5 week6 week7
## Min. :24.00 Min. :20.00 Min. :20.00 Min. :19.00 Min. :18.0
## 1st Qu.:29.75 1st Qu.:28.00 1st Qu.:26.00 1st Qu.:22.75 1st Qu.:23.0
## Median :36.50 Median :34.50 Median :30.50 Median :28.50 Median :30.0
## Mean :39.15 Mean :36.35 Mean :32.55 Mean :31.23 Mean :32.2
## 3rd Qu.:44.50 3rd Qu.:43.00 3rd Qu.:38.00 3rd Qu.:37.00 3rd Qu.:38.0
## Max. :76.00 Max. :66.00 Max. :64.00 Max. :64.00 Max. :62.0
## week8
## Min. :20.00
## 1st Qu.:22.75
## Median :28.00
## Mean :31.43
## 3rd Qu.:35.25
## Max. :75.00
The BPRS dataset in the wide data form: we have 40 subjects, men, that are split into two groups and biological measurements are taken for each man every week, from week0 (baseline measurement) to week8.
# convert categorical variables to factors
BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)
# convert to longitudinal form
BPRS_L <- pivot_longer(BPRS, cols= -c(treatment,subject), names_to = "weeks",
values_to = "bprs") %>% arrange(weeks)
BPRS_L <- BPRS_L %>% mutate(week = as.integer(substr(weeks,5,5)))
Now the dataset is in the longitudinal form. In the wide form, we had one line per each man where variables were the groups and the time-points when the biological measurements were taken. However, in the longitudinal form, we have as many lines as many men * time points (and 40 * 9): each line now is man #1 - time point #1 - biological measurement for time point #1.
# plot the RATSL data
ggplot(BPRS_L, aes(x = week, y = bprs, group = interaction(subject, treatment))) +
geom_line(aes(linetype = treatment )) +
scale_y_continuous(name = "bprs") +
theme(legend.position = "top")
Looking at the plot, it looks like there’s no difference in the effect of the two different treatments since the lines from the two groups overlap quite well.
# create a regression model RATS_reg
fit_BPRS <- lm(bprs ~ week + treatment, data = BPRS_L)
# print out a summary of the model
summary(fit_BPRS)
##
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRS_L)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## week -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
From the fitted linear model, we can see that time (week variable) is very significant in the regression. Treatment2 doesn’t look important though. However, the independence assumption is violated, so we would need a random intercept model.
As explained before, the linear model assumes independence of the repeated measures of bprs, but it’s not the case: we need to fit a random intercept model for the same two explanatory variables, week and treatment. This will allow the linear regression fit for each man to differ in intercept from other men.
library(lmerTest)
##
## Attaching package: 'lmerTest'
## The following object is masked from 'package:lme4':
##
## lmer
## The following object is masked from 'package:stats':
##
## step
# Create a random intercept model
fit_BPRS_2 <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRS_L,
REML = FALSE)
# Print the summary of the model
summary(fit_BPRS_2)
## Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
## method [lmerModLmerTest]
## Formula: bprs ~ week + treatment + (1 | subject)
## Data: BPRS_L
##
## AIC BIC logLik deviance df.resid
## 2748.7 2768.1 -1369.4 2738.7 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0481 -0.6749 -0.1361 0.4813 3.4855
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 47.41 6.885
## Residual 104.21 10.208
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 46.4539 1.9090 37.2392 24.334 <2e-16 ***
## week -2.2704 0.2084 340.0000 -10.896 <2e-16 ***
## treatment2 0.5722 1.0761 340.0000 0.532 0.595
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.437
## treatment2 -0.282 0.000
From the Random Intercept Model we can have a look at the random
effects. The subject derived effect looks to be very
important in the model. This can be observed by the fact that t.test is
given by estimate (variance, in this case 104.21) / SD (SD =
SE/sqrt(sample size)): if the resulting t.test is > abs(1.93), then
the pvalue is significant and the effect is significant as well.
by running this model wich takes into account the randomness coming from
the subject, we correct the fixed effects and now we
can trust these results
Now we can move on to fit the random intercept and random slope model. Fitting a random intercept and random slope model allows the linear regression fits for each individual to differ in intercept but also in slope.
fit_BPRS_3 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRS_L,
REML = FALSE)
# print a summary of the model
summary(fit_BPRS_3)
## Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
## method [lmerModLmerTest]
## Formula: bprs ~ week + treatment + (week | subject)
## Data: BPRS_L
##
## AIC BIC logLik deviance df.resid
## 2745.4 2772.6 -1365.7 2731.4 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8919 -0.6194 -0.0691 0.5531 3.7976
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.8222 8.0512
## week 0.9609 0.9802 -0.51
## Residual 97.4305 9.8707
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 46.4539 2.1052 22.6795 22.066 < 2e-16 ***
## week -2.2704 0.2977 19.9991 -7.626 2.42e-07 ***
## treatment2 0.5722 1.0405 320.0005 0.550 0.583
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.582
## treatment2 -0.247 0.000
# perform an ANOVA test on the two models
anova(fit_BPRS_3, fit_BPRS_2)
## Data: BPRS_L
## Models:
## fit_BPRS_2: bprs ~ week + treatment + (1 | subject)
## fit_BPRS_3: bprs ~ week + treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## fit_BPRS_2 5 2748.7 2768.1 -1369.4 2738.7
## fit_BPRS_3 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
When adding week as a random effect, we improve the model (anova testing results in a pvalue < 0.05). This means that week is important as well as random effect.
# create a random intercept and random slope model with the interaction
fit_BPRS_interaction <- lmer(bprs ~ week * treatment + (week | subject), data = BPRS_L,
REML = FALSE)
# print a summary of the model
summary(fit_BPRS_interaction)
## Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
## method [lmerModLmerTest]
## Formula: bprs ~ week * treatment + (week | subject)
## Data: BPRS_L
##
## AIC BIC logLik deviance df.resid
## 2744.3 2775.4 -1364.1 2728.3 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0512 -0.6271 -0.0768 0.5288 3.9260
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.9964 8.0620
## week 0.9687 0.9842 -0.51
## Residual 96.4707 9.8220
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 47.8856 2.2521 29.6312 21.262 < 2e-16 ***
## week -2.6283 0.3589 41.7201 -7.323 5.24e-09 ***
## treatment2 -2.2911 1.9090 319.9977 -1.200 0.2310
## week:treatment2 0.7158 0.4010 319.9977 1.785 0.0752 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.650
## treatment2 -0.424 0.469
## wek:trtmnt2 0.356 -0.559 -0.840
# perform an ANOVA test on the two models
anova(fit_BPRS_interaction, fit_BPRS_3)
## Data: BPRS_L
## Models:
## fit_BPRS_3: bprs ~ week + treatment + (week | subject)
## fit_BPRS_interaction: bprs ~ week * treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## fit_BPRS_3 7 2745.4 2772.6 -1365.7 2731.4
## fit_BPRS_interaction 8 2744.3 2775.4 -1364.1 2728.3 3.1712 1 0.07495 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# plot the RATSL data
ggplot(BPRS_L, aes(x = week, y = bprs, group = interaction(subject, treatment))) +
geom_line(aes(linetype = treatment )) +
scale_y_continuous(name = "bprs") +
theme(legend.position = "top")
# Create a vector of the fitted values and add to BPRS_L
Fitted <- fitted(fit_BPRS_interaction)
BPRS_L$fitted <- Fitted
# plot the RATSL data
ggplot(BPRS_L, aes(x = week, y = fitted, group = interaction(subject, treatment))) +
geom_line(aes(linetype = treatment )) +
scale_y_continuous(name = "bprs") +
theme(legend.position = "top")
Finally, when we fit a model with the interaction between
week and treatment we get no
significant difference from the previous one (anova pvalue >
0.05).